Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 41.445
Filtrar
1.
Nat Commun ; 15(1): 3108, 2024 Apr 10.
Artículo en Inglés | MEDLINE | ID: mdl-38600080

RESUMEN

The senescence of fruit is a complex physiological process, with various cell types within the pericarp, making it highly challenging to elucidate their individual roles in fruit senescence. In this study, a single-cell expression atlas of the pericarp of pitaya (Hylocereus undatus) is constructed, revealing exocarp and mesocarp cells undergoing the most significant changes during the fruit senescence process. Pseudotime analysis establishes cellular differentiation and gene expression trajectories during senescence. Early-stage oxidative stress imbalance is followed by the activation of resistance in exocarp cells, subsequently senescence-associated proteins accumulate in the mesocarp cells at late-stage senescence. The central role of the early response factor HuCMB1 is unveiled in the senescence regulatory network. This study provides a spatiotemporal perspective for a deeper understanding of the dynamic senescence process in plants.


Asunto(s)
Cactaceae , Frutas , Frutas/genética , Proteínas/genética , Cactaceae/genética , Análisis de Secuencia de ARN
2.
Brief Bioinform ; 25(3)2024 Mar 27.
Artículo en Inglés | MEDLINE | ID: mdl-38600664

RESUMEN

Small open reading frames (smORFs) have been acknowledged to play various roles on essential biological pathways and affect human beings from diabetes to tumorigenesis. Predicting smORFs in silico is quite a prerequisite for processing the omics data. Here, we proposed the smORF-coding-potential-predicting framework, sOCP, which provides functions to construct a model for predicting novel smORFs in some species. The sOCP model constructed in human was based on in-frame features and the nucleotide bias around the start codon, and the small feature subset was proved to be competent enough and avoid overfitting problems for complicated models. It showed more advanced prediction metrics than previous methods and could correlate closely with experimental evidence in a heterogeneous dataset. The model was applied to Rattus norvegicus and exhibited satisfactory performance. We then scanned smORFs with ATG and non-ATG start codons from the human genome and generated a database containing about a million novel smORFs with coding potential. Around 72 000 smORFs are located on the lncRNA regions of the genome. The smORF-encoded peptides may be involved in biological pathways rare for canonical proteins, including glucocorticoid catabolic process and the prokaryotic defense system. Our work provides a model and database for human smORF investigation and a convenient tool for further smORF prediction in other species.


Asunto(s)
Genoma Humano , Péptidos , Animales , Humanos , Ratas , Sistemas de Lectura Abierta , Péptidos/genética , Proteínas/genética
3.
Biomed Res Int ; 2024: 2501086, 2024.
Artículo en Inglés | MEDLINE | ID: mdl-38659607

RESUMEN

Purpose: Recurrent miscarriage (RM) is a significant reproductive concern affecting numerous women globally. Genetic factors are believed to play a crucial role in RM, making the histidine-rich glycoprotein (HRG) gene, a topic of interest due to its potential involvement in angiogenesis. This study is aimed at investigating the association between the HRG rs10770 genotype and RM. Method: Blood samples were collected from a total of 200 women at the beginning of the study. Subsequently, a comparative analysis was conducted between the blood samples of 100 women with a history of RM (case group) and the blood samples of another 100 healthy women (control group). HRG rs10770 genotyping was performed through polymerase chain reaction restriction-fragment length polymorphism (PCR-RFLP), followed by statistical analysis to evaluate the relationship between HRG rs10770 genotype and RM. Results: The results indicated a significant statistical difference between the C/C genotype (OR = 3.32, CI: 1.22-9.04, p = 0.01) and the C/T genotype (OR = 1.24, CI: 0.67-2.30, p = 0.47) in both the case and control groups. Additionally, a significant correlation was observed in the C allelic frequency among RM participants compared to the control group (OR = 1.65, CI: 1.06-2.58, p = 0.02). Conclusion: The study highlights the importance of HRG rs10770 in understanding RM, shedding light on its implications for reproductive health. Furthermore, it became evident that women carrying the homozygous C/C genotype exhibited increased susceptibility to the risk of RM.


Asunto(s)
Aborto Habitual , Frecuencia de los Genes , Predisposición Genética a la Enfermedad , Genotipo , Polimorfismo de Nucleótido Simple , Proteínas , Humanos , Femenino , Aborto Habitual/genética , Adulto , Irán , Embarazo , Polimorfismo de Nucleótido Simple/genética , Frecuencia de los Genes/genética , Proteínas/genética , Estudios de Casos y Controles , Estudios de Asociación Genética , Alelos
4.
ACS Synth Biol ; 13(4): 1085-1092, 2024 Apr 19.
Artículo en Inglés | MEDLINE | ID: mdl-38568188

RESUMEN

Computational protein sequence design has the ambitious goal of modifying existing or creating new proteins; however, designing stable and functional proteins is challenging without predictability of protein dynamics and allostery. Informing protein design methods with evolutionary information limits the mutational space to more native-like sequences and results in increased stability while maintaining functions. Recently, language models, trained on millions of protein sequences, have shown impressive performance in predicting the effects of mutations. Assessing Rosetta-designed sequences with a language model showed scores that were worse than those of their original sequence. To inform Rosetta design protocols with language model predictions, we added a new metric to restrain the energy function during design using the Evolutionary Scale Modeling (ESM) model. The resulting sequences have better language model scores and similar sequence recovery, with only a minor decrease in the fitness as assessed by Rosetta energy. In conclusion, our work combines the strength of recent machine learning approaches with the Rosetta protein design toolbox.


Asunto(s)
Proteínas , Proteínas/genética , Secuencia de Aminoácidos
5.
Genome Biol Evol ; 16(4)2024 Apr 02.
Artículo en Inglés | MEDLINE | ID: mdl-38597156

RESUMEN

De novo genes emerge from previously noncoding stretches of the genome. Their encoded de novo proteins are generally expected to be similar to random sequences and, accordingly, with no stable tertiary fold and high predicted disorder. However, structural properties of de novo proteins and whether they differ during the stages of emergence and fixation have not been studied in depth and rely heavily on predictions. Here we generated a library of short human putative de novo proteins of varying lengths and ages and sorted the candidates according to their structural compactness and disorder propensity. Using Förster resonance energy transfer combined with Fluorescence-activated cell sorting, we were able to screen the library for most compact protein structures, as well as most elongated and flexible structures. We find that compact de novo proteins are on average slightly shorter and contain lower predicted disorder than less compact ones. The predicted structures for most and least compact de novo proteins correspond to expectations in that they contain more secondary structure content or higher disorder content, respectively. Our experiments indicate that older de novo proteins have higher compactness and structural propensity compared with young ones. We discuss possible evolutionary scenarios and their implications underlying the age-dependencies of compactness and structural content of putative de novo proteins.


Asunto(s)
Pliegue de Proteína , Proteínas , Humanos , Proteínas/genética , Estructura Secundaria de Proteína , Biblioteca de Genes
6.
PLoS One ; 19(4): e0301871, 2024.
Artículo en Inglés | MEDLINE | ID: mdl-38593165

RESUMEN

Genome sequencing has revealed an incredible diversity of bacteria and archaea, but there are no fast and convenient tools for browsing across these genomes. It is cumbersome to view the prevalence of homologs for a protein of interest, or the gene neighborhoods of those homologs, across the diversity of the prokaryotes. We developed a web-based tool, fast.genomics, that uses two strategies to support fast browsing across the diversity of prokaryotes. First, the database of genomes is split up. The main database contains one representative from each of the 6,377 genera that have a high-quality genome, and additional databases for each taxonomic order contain up to 10 representatives of each species. Second, homologs of proteins of interest are identified quickly by using accelerated searches, usually in a few seconds. Once homologs are identified, fast.genomics can quickly show their prevalence across taxa, view their neighboring genes, or compare the prevalence of two different proteins. Fast.genomics is available at https://fast.genomics.lbl.gov.


Asunto(s)
Archaea , Bacterias , Archaea/genética , Bacterias/genética , Genómica , Proteínas/genética , Mapeo Cromosómico
7.
BMJ Case Rep ; 17(4)2024 Apr 17.
Artículo en Inglés | MEDLINE | ID: mdl-38631813

RESUMEN

A man in his 30s was referred to neurology with right-sided paraesthesia, tremors, chest pain and lower urinary tract and erectile dysfunction. He had a medical history of left acetabular dysplasia, and subjective memory impairment, the latter being in the context of depression and chronic pain with opioid use. There was no notable family history. On examination, he had a spastic paraparesis. Imaging revealed atrophy of the thoracic spine. Lumbar puncture demonstrated a raised protein but other constituents were normal, including no presence of oligoclonal bands. Genetic testing revealed a novel heterozygous likely pathogenic SPAST variant c. 1643A>T p.(Asp548Val), confirming the diagnosis of hereditary spastic paraparesis. Symptomatic treatment with physiotherapy and antispasmodic therapy was initiated. This is the first study reporting a patient with this SPAST variant. Ensembl variant effect predictor was used, with the application of computational variant prediction tools providing support that the variant we have identified is likely deleterious and damaging. Our variant CADD score was high, indicating that our identified variant was a highly deleterious substitution.


Asunto(s)
Paraparesia Espástica , Paraplejía Espástica Hereditaria , Masculino , Humanos , Paraparesia Espástica/genética , Paraplejía Espástica Hereditaria/genética , Linaje , Proteínas/genética , Pruebas Genéticas , Mutación , Espastina/genética
8.
Protein Sci ; 33(5): e4971, 2024 May.
Artículo en Inglés | MEDLINE | ID: mdl-38591647

RESUMEN

As protein crystals are increasingly finding diverse applications as scaffolds, controlled crystal polymorphism presents a facile strategy to form crystalline assemblies with controllable porosity with minimal to no protein engineering. Polymorphs of consensus tetratricopeptide repeat proteins with varying porosity were obtained through co-crystallization with metal salts, exploiting the innate metal ion geometric requirements. A single structurally exposed negative amino acid cluster was responsible for metal coordination, despite the abundance of negatively charged residues. Density functional theory calculations showed that while most of the crystals were the most thermodynamically stable assemblies, some were kinetically trapped states. Thus, crystalline porosity diversity is achieved and controlled with metal coordination, opening a new scope in the application of proteins as biocompatible protein-metal-organic frameworks (POFs). In addition, metal-dependent polymorphic crystals allow direct comparison of metal coordination preferences.


Asunto(s)
Estructuras Metalorgánicas , Proteínas , Proteínas/genética , Proteínas/química , Metales/química , Cristalización
9.
Sci Rep ; 14(1): 8136, 2024 04 07.
Artículo en Inglés | MEDLINE | ID: mdl-38584172

RESUMEN

Computational approaches for predicting the pathogenicity of genetic variants have advanced in recent years. These methods enable researchers to determine the possible clinical impact of rare and novel variants. Historically these prediction methods used hand-crafted features based on structural, evolutionary, or physiochemical properties of the variant. In this study we propose a novel framework that leverages the power of pre-trained protein language models to predict variant pathogenicity. We show that our approach VariPred (Variant impact Predictor) outperforms current state-of-the-art methods by using an end-to-end model that only requires the protein sequence as input. Using one of the best-performing protein language models (ESM-1b), we establish a robust classifier that requires no calculation of structural features or multiple sequence alignments. We compare the performance of VariPred with other representative models including 3Cnet, Polyphen-2, REVEL, MetaLR, FATHMM and ESM variant. VariPred performs as well as, or in most cases better than these other predictors using six variant impact prediction benchmarks despite requiring only sequence data and no pre-processing of the data.


Asunto(s)
Mutación Missense , Proteínas , Virulencia , Proteínas/genética , Secuencia de Aminoácidos , Biología Computacional/métodos
10.
Funct Integr Genomics ; 24(2): 45, 2024 Mar 01.
Artículo en Inglés | MEDLINE | ID: mdl-38429550

RESUMEN

Gracilariaceae is a group of marine large red algae and main source of agar with important economic and ecological value. The codon usage patterns of chloroplast genomes in 36 species from Graciliaceae show that GC range from 0.284 to 0.335, the average GC3 range from 0.135 to 0.243 and the value of ENC range from 35.098 to 42.327, which indicates these genomes are rich in AT and prefer to use codons ending with AT in these species. Nc plot, PR2 plot, neutrality plot analyses and correlation analysis indicate that these biases may be caused by multiple factors, such as natural selection and mutation pressure, but prolonged natural selection is the main driving force influencing codon usage preference. The cluster analysis and phylogenetic analysis show that the differentiation relationship of them is different and indicate that codons with weak or unbiased preferences may also play an irreplaceable role in these species' evolution. In addition, we identified 26 common high-frequency codons and 8-18 optimal codons all ending in A/U in these 36 species. Our results will not only contribute to carrying out transgenic work in Gracilariaceae species to maximize the protein yield in the future, but also lay a theoretical foundation for further exploring systematic classification of them.


Asunto(s)
Uso de Codones , Genoma del Cloroplasto , Filogenia , Codón/genética , Proteínas/genética
11.
Methods Mol Biol ; 2760: 371-392, 2024.
Artículo en Inglés | MEDLINE | ID: mdl-38468099

RESUMEN

Genetic engineering has revolutionized our ability to manipulate DNA and engineer organisms for various applications. However, this approach can lead to genomic instability, which can result in unwanted effects such as toxicity, mutagenesis, and reduced productivity. To overcome these challenges, smart design of synthetic DNA has emerged as a promising solution. By taking into consideration the intricate relationships between gene expression and cellular metabolism, researchers can design synthetic constructs that minimize metabolic stress on the host cell, reduce mutagenesis, and increase protein yield. In this chapter, we summarize the main challenges of genomic instability in genetic engineering and address the dangers of unknowingly incorporating genomically unstable sequences in synthetic DNA. We also demonstrate the instability of those sequences by the fact that they are selected against conserved sequences in nature. We highlight the benefits of using ESO, a tool for the rational design of DNA for avoiding genetically unstable sequences, and also summarize the main principles and working parameters of the software that allow maximizing its benefits and impact.


Asunto(s)
Ingeniería Genética , Inestabilidad Genómica , Humanos , ADN/genética , Proteínas/genética
12.
Cell Mol Biol (Noisy-le-grand) ; 70(2): 24-29, 2024 02 29.
Artículo en Inglés | MEDLINE | ID: mdl-38430045

RESUMEN

The genetics of organisms play a vital role in the development of coronary artery disease (CAD), with its heritability estimated at approximately 50-60%. For this purpose, we examined the relationship between CAD risk and C12orf43/rs2258287 polymorphisms in the Pakistani population. In this study based on the genetic approach to dyslipidemia, a total of 200 subjects were included from the southern Punjab. The biochemical analysis of parameters (total cholesterol, triglycerides, blood glucose, high-density lipoprotein, and low-density lipoprotein) was carried out along with molecular analysis using an ARMS-PCR-based assay for single-nucleotide polymorphism (SNP) C12orf43/rs2258287 to identify the genotype. Genotypes showed a substantial correlation with both family history and metabolic markers. The cholesterol, low-density lipoprotein cholesterol (LDL-C), triglycerides and blood glucose levels were higher while the high-density lipoprotein cholesterol (HDL-C) level was lower significantly (p<0.05) in cases than in controls. Age, pulse rate, diabetes, physical activity, smoking, family history, and dietary habits were also significantly associated (p<0.05) with CAD individuals. The SNP C12orf43/rs2258287 also showed an association with CAD in the population of southern Punjab. Based upon this study, it could be concluded that CAD is characterized by an unfavorable lipid profile in association with SNP C12orf43/rs2258287.


Asunto(s)
Enfermedad de la Arteria Coronaria , Proteínas , Humanos , Glucemia , Colesterol , LDL-Colesterol , Enfermedad de la Arteria Coronaria/genética , Predisposición Genética a la Enfermedad , Lipoproteínas HDL , Polimorfismo de Nucleótido Simple/genética , Factores de Riesgo , Triglicéridos , Proteínas/genética
13.
J Math Biol ; 88(5): 50, 2024 Mar 29.
Artículo en Inglés | MEDLINE | ID: mdl-38551701

RESUMEN

Network alignment aims to uncover topologically similar regions in the protein-protein interaction (PPI) networks of two or more species under the assumption that topologically similar regions tend to perform similar functions. Although there exist a plethora of both network alignment algorithms and measures of topological similarity, currently no "gold standard" exists for evaluating how well either is able to uncover functionally similar regions. Here we propose a formal, mathematically and statistically rigorous method for evaluating the statistical significance of shared GO terms in a global, 1-to-1 alignment between two PPI networks. Given an alignment in which k aligned protein pairs share a particular GO term g, we use a combinatorial argument to precisely quantify the p-value of that alignment with respect to g compared to a random alignment. The p-value of the alignment with respect to all GO terms, including their inter-relationships, is approximated using the Empirical Brown's Method. We note that, just as with BLAST's p-values, this method is not designed to guide an alignment algorithm towards a solution; instead, just as with BLAST, an alignment is guided by a scoring matrix or function; the p-values herein are computed after the fact, providing independent feedback to the user on the biological quality of the alignment that was generated by optimizing the scoring function. Importantly, we demonstrate that among all GO-based measures of network alignments, ours is the only one that correlates with the precision of GO annotation predictions, paving the way for network alignment-based protein function prediction.


Asunto(s)
Algoritmos , Biología Computacional , Ontología de Genes , Biología Computacional/métodos , Alineación de Secuencia , Mapas de Interacción de Proteínas , Proteínas/genética
14.
Genes Genomics ; 46(5): 601-611, 2024 May.
Artículo en Inglés | MEDLINE | ID: mdl-38546934

RESUMEN

Human advancements in agriculture, urbanization, and industrialization have led to various forms of environmental pollution, including heavy metal pollution. Insects, as highly adaptable organisms, can survive under various environmental stresses, which induce oxidative damage and impair antioxidant systems. To investigate the peroxidase (POX) family in Tenebrio molitor, we characterized two POXs, namely TmPOX-iso1 and TmPOX-iso2. The full-length cDNA sequences of TmPox-iso1 and TmPox-iso2 respectively consisted of an open reading frame of 1815 bp encoding 605 amino acids and an open reading frame of 2229 bp encoding 743 amino acids. TmPOX-iso1 and TmPOX-iso2 homologs were found in five distinct insect orders. In the phylogenetic tree analysis, TmPOX-iso1 was clustered with the predicted POX protein of T. castaneum, and TmPOX-iso2 was clustered with the POX precursor protein of T. castaneum. During development, the highest expression level of TmPox-iso1 was observed in the pre-pupal stage, while that of TmPox-iso2 expression were observed in the pre-pupal and 4-day pupal stages. TmPox-iso1 was primarily expressed in the early and late larval gut, while TmPox-iso2 mRNA expression was higher in the fat bodies and Malpighian tubules. In response to cadmium chloride treatment, TmPox-iso1 expression increased at 3 hours and then declined until 24 hours, while in the zinc chloride-treated group, TmPox-iso1 expression peaked 24 hours after the treatment. Both treated groups showed increases in TmPox-iso2 expression 24 hours after the treatments.


Asunto(s)
Tenebrio , Animales , Humanos , Tenebrio/genética , Peroxidasas/genética , Filogenia , Proteínas/genética , Aminoácidos/genética
15.
Genes (Basel) ; 15(3)2024 Mar 10.
Artículo en Inglés | MEDLINE | ID: mdl-38540408

RESUMEN

The production of milk by dairy cows far exceeds the nutritional needs of the calf and is vital for the economical use of dairy cattle. High milk yield is a unique production trait that can be effectively enhanced through traditional selection methods. The process of lactation in cows serves as an excellent model for studying the biological aspects of lactation with the aim of exploring the mechanistic base of this complex trait at the cellular level. In this study, we analyzed the milk transcriptome at the single-cell level by conducting scRNA-seq analysis on milk samples from two Holstein Friesian cows at mid-lactation (75 and 93 days) using the 10× Chromium platform. Cells were pelleted and fat was removed from milk by centrifugation. The cell suspension from each cow was loaded on separate channels, resulting in the recovery of 9313 and 14,544 cells. Library samples were loaded onto two lanes of the NovaSeq 6000 (Illumina) instrument. After filtering at the cell and gene levels, a total of 7988 and 13,973 cells remained, respectively. We were able to reconstruct different cell types (milk-producing cells, progenitor cells, macrophages, monocytes, dendritic cells, T cells, B cells, mast cells, and neutrophils) in bovine milk. Our findings provide a valuable resource for identifying regulatory elements associated with various functions of the mammary gland such as lactation, tissue renewal, native immunity, protein and fat synthesis, and hormonal response.


Asunto(s)
Leche , Transcriptoma , Femenino , Animales , Bovinos , Leche/metabolismo , Transcriptoma/genética , Lactancia/genética , Proteínas/genética , Fenotipo
16.
J Mol Evol ; 92(2): 153-168, 2024 Apr.
Artículo en Inglés | MEDLINE | ID: mdl-38485789

RESUMEN

Protein Protein low complexity regions (LCRs) are compositionally biased amino acid sequences, many of which have significant evolutionary impacts on the proteins which contain them. They are mutationally unstable experiencing higher rates of indels and substitutions than higher complexity regions. LCRs also impact the expression of their proteins, likely through multiple effects along the path from gene transcription, through translation, and eventual protein degradation. It has been observed that proteins which contain LCRs are associated with elevated transcript abundance (TAb), despite having lower protein abundance. We have gathered and integrated human data to investigate the co-evolution of TAb and LCRs through ancestral reconstructions and model inference using an approximate Bayesian calculation based method. We observe that on short evolutionary timescales TAb evolution is significantly impacted by changes in LCR length, with insertions driving TAb down. But in contrast, the observed data is best explained by indel rates in LCRs which are unaffected by shifts in TAb. Our work demonstrates a coupling between LCR and TAb evolution, and the utility of incorporating multiple responses into evolutionary analyses.


Asunto(s)
Evolución Molecular , Proteínas , Humanos , Teorema de Bayes , Proteínas/genética , Proteínas/química , Secuencia de Aminoácidos , Dominios Proteicos
17.
J Mol Evol ; 92(2): 181-206, 2024 Apr.
Artículo en Inglés | MEDLINE | ID: mdl-38502220

RESUMEN

Ancestral sequence reconstruction (ASR) is a phylogenetic method widely used to analyze the properties of ancient biomolecules and to elucidate mechanisms of molecular evolution. Despite its increasingly widespread application, the accuracy of ASR is currently unknown, as it is generally impossible to compare resurrected proteins to the true ancestors. Which evolutionary models are best for ASR? How accurate are the resulting inferences? Here we answer these questions using a cross-validation method to reconstruct each extant sequence in an alignment with ASR methodology, a method we term "extant sequence reconstruction" (ESR). We thus can evaluate the accuracy of ASR methodology by comparing ESR reconstructions to the corresponding known true sequences. We find that a common measure of the quality of a reconstructed sequence, the average probability, is indeed a good estimate of the fraction of correct amino acids when the evolutionary model is accurate or overparameterized. However, the average probability is a poor measure for comparing reconstructions from different models, because, surprisingly, a more accurate phylogenetic model often results in reconstructions with lower probability. While better (more predictive) models may produce reconstructions with lower sequence identity to the true sequences, better models nevertheless produce reconstructions that are more biophysically similar to true ancestors. In addition, we find that a large fraction of sequences sampled from the reconstruction distribution may have fewer errors than the single most probable (SMP) sequence reconstruction, despite the fact that the SMP has the lowest expected error of all possible sequences. Our results emphasize the importance of model selection for ASR and the usefulness of sampling sequence reconstructions for analyzing ancestral protein properties. ESR is a powerful method for validating the evolutionary models used for ASR and can be applied in practice to any phylogenetic analysis of real biological sequences. Most significantly, ESR uses ASR methodology to provide a general method by which the biophysical properties of resurrected proteins can be compared to the properties of the true protein.


Asunto(s)
Evolución Biológica , Proteínas , Filogenia , Proteínas/genética , Proteínas/química , Evolución Molecular , Aminoácidos
18.
Methods Mol Biol ; 2793: 65-82, 2024.
Artículo en Inglés | MEDLINE | ID: mdl-38526724

RESUMEN

Protein-protein interaction is at the heart of most biological processes, and small peptides that bind to protein binding sites are resourceful tools to explore and understand the structural requirements for these interactions. In that sense, phage display is a well-suited technology to study protein-protein interactions, as it allows for unbiased screening of billions of peptides in search for those that interact with a protein binding domain. Here, we will illustrate how two distinct but complementary approaches, phage display and nuclear magnetic resonance (NMR), can be utilized to unveil structural details of peptide-protein interaction. Finally, knowledge derived from phage mutagenesis and NMR studies can be streamlined for quick peptidomimetic design and synthesis using the retroinversion approach to validate using in vitro and in vivo assays the therapeutic potential of peptides identified by phage display.


Asunto(s)
Peptidomiméticos , Biblioteca de Péptidos , Péptidos/química , Proteínas/genética , Técnicas de Visualización de Superficie Celular
19.
NPJ Syst Biol Appl ; 10(1): 29, 2024 Mar 15.
Artículo en Inglés | MEDLINE | ID: mdl-38491038

RESUMEN

Understanding the biological functions of proteins is of fundamental importance in modern biology. To represent a function of proteins, Gene Ontology (GO), a controlled vocabulary, is frequently used, because it is easy to handle by computer programs avoiding open-ended text interpretation. Particularly, the majority of current protein function prediction methods rely on GO terms. However, the extensive list of GO terms that describe a protein function can pose challenges for biologists when it comes to interpretation. In response to this issue, we developed GO2Sum (Gene Ontology terms Summarizer), a model that takes a set of GO terms as input and generates a human-readable summary using the T5 large language model. GO2Sum was developed by fine-tuning T5 on GO term assignments and free-text function descriptions for UniProt entries, enabling it to recreate function descriptions by concatenating GO term descriptions. Our results demonstrated that GO2Sum significantly outperforms the original T5 model that was trained on the entire web corpus in generating Function, Subunit Structure, and Pathway paragraphs for UniProt entries.


Asunto(s)
Proteínas , Programas Informáticos , Humanos , Ontología de Genes , Proteínas/genética
20.
Sci Rep ; 14(1): 6009, 2024 03 12.
Artículo en Inglés | MEDLINE | ID: mdl-38472223

RESUMEN

Protein-protein interactions (PPIs) play essential roles in most biological processes. The binding interfaces between interacting proteins impose evolutionary constraints that have successfully been employed to predict PPIs from multiple sequence alignments (MSAs). To construct MSAs, critical choices have to be made: how to ensure the reliable identification of orthologs, and how to optimally balance the need for large alignments versus sufficient alignment quality. Here, we propose a divide-and-conquer strategy for MSA generation: instead of building a single, large alignment for each protein, multiple distinct alignments are constructed under distinct clades in the tree of life. Coevolutionary signals are searched separately within these clades, and are only subsequently integrated using machine learning techniques. We find that this strategy markedly improves overall prediction performance, concomitant with better alignment quality. Using the popular DCA algorithm to systematically search pairs of such alignments, a genome-wide all-against-all interaction scan in a bacterial genome is demonstrated. Given the recent successes of AlphaFold in predicting direct PPIs at atomic detail, a discover-and-refine approach is proposed: our method could provide a fast and accurate strategy for pre-screening the entire genome, submitting to AlphaFold only promising interaction candidates-thus reducing false positives as well as computation time.


Asunto(s)
Algoritmos , Proteínas , Alineación de Secuencia , Proteínas/genética , Evolución Biológica , Filogenia , Biología Computacional/métodos
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA
...